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ABSTRACT 

Motivation: Capillary electrophoresis (CE) of nucleic acids is a 
workhorse technology underlying high-throughput genome analysis 
and large-scale chemical mapping for nucleic acid structural 
inference. Despite the wide availability of CE-based instruments, 
there remain challenges in leveraging their full power for quantitative 
analysis of RNA and DNA structure, thermodynamics, and kinetics. 
In particular, the slow rate and poor automation of available analysis 
tools have bottlenecked a new generation of studies involving 
hundreds of CE profiles per experiment. 

Results: We propose a computational method called high-throughput 
robust analysis for capillary electrophoresis (HiTRACE) to automate 
the key tasks in large-scale nucleic acid CE analysis, including the 
profile alignment that has heretofore been a rate-limiting step in 
the highest throughput experiments. We illustrate the application 
of HiTRACE on thirteen data sets representing 4 different RNAs, 
three chemical modification strategies, and up to 480 single 
mutant variants; the largest data sets each include 87,360 bands. 
By applying a series of robust dynamic programming algorithms, 
HiTRACE outperforms prior tools in terms of alignment and fitting 
quality, as assessed by measures including the correlation between 
quantified band intensities between replicate data sets. Furthermore, 
while the smallest of these data sets required 7 to 10 hours of 
manual intervention using prior approaches, HiTRACE quantitation of 
even the largest data sets herein was achieved in 3 to 12 minutes. 
The HiTRACE method therefore resolves a critical barrier to the 
efficient and accurate analysis of nucleic acid structure in experiments 
involving tens of thousands of electrophoretic bands. 
Availability: HiTRACE is freely available for download at 
http://hitrace.stanford.edu. 
Contact: sryoon(g)korea. ac.kr, rhiju@stanford.edu| 



1 INTRODUCTION 

Capillary electrophoresis (CE) is a widely used approach for 
biochemical analysis. The rapid electrophoretic separation of 
fluorescently labeled nucleic acid fragments inside electrolyte- 
filled capillaries significantly accelerated genome sequencing (RuizJ 
jMartinez et fl/.||1993]|Woolley and Mathiesl|1995| ). A more recent, 
powerful application of CE enables the high-throughput structure 
analysis of self- assembling nucleic-acid-containing systems fMitraj 



et al 2008, Vasa et al 2008, Weeks, 2010, Das et al 2010 
Kladwang and Das,^2010j as complex as v iruses (Watts a/.^ 2009 



Wilkinson et al.\ [2008 1 ) and ribosomes ( [Deigan et al.\ |2009| ) at 



single-nucleotide resolution. 

The CE profiles obtained in this recent generation of 'structure- 
mapping' experiments present tens of thousands of individual 
electrophoretic bands; quantifying these data gives detailed portraits 
of nucleic acid structure, folding thermodynamics, and kinetics but 
requires significant informatics efforts ( |Mitra et 'all |2008| ). 'Base- 
calling' software packages can assign sequences to these bands 
in special four-color experiments (see, e.g., [Ewing et al.\ |1998] 
|Ewing and Green|[l998| ) but are not applicable to structure mapping 
experiments, which require more robust sequence annotation and 
quantitative fits of each profile to a sum of peak shapes. Such 
quantitative analysis is aided by the design of experiments so that the 
desired information appears as differences between corresponding 
bands across profiles [see, e.g., ( |Das et aT\ |2005[ [Kladwang and| 
|Das| |2010| )]; then, sequence annotation of one profile results in 
annotation of corresponding bands across the entire data. For these 
data sets, tools for alignment of features, or 'rectification' ( Das] 
[gToI] [20051 [Laederach efal] [2008 ):), across different profiles 
resulted in improvements in quantification speed and accuracy, 
but these tools remain poorly automated. As the experimental 
steps of large-scale CE measurements continue to accelerate, the 
bioinformatic task of profile alignment has become a rate-limiting 
step in carrying out these information-rich structural studies. 

Current approaches to aligning and fitting capillary profiles 
include capillary automated footprinting analysis (CAFA; |Mitra| 
gF^T] [2008} and ShapeFinder < |Vasa et a/.|[2058l »; we have found 
these methods difficult to apply to large-scale titration or mutate- 
and-map data sets ( [Kladwang and Das] [2010[ [Kladwang et al.\ 
[201 1 ). For instance, CAFA is focused more on peak fitting and has 
limited alignment capabilities. The ShapeFinder alignment function 
can align spectrally separated products within a single capillary but 
not profiles across multiple capillaries with initially poor alignment. 
Use of these tools requires tedious manual intervention and risks 
bias or unnecessary errors from such manipulation. Analysis tools 
for alignment and peak fitting have also been proposed in other 
domains such as chromatograph y ([Nielsen et al. 



2005 



T9981 [Tomasi[ 



Kazmi et al. 



*to whom correspondence should be addressed 



et a/.[[2004| , mass spectrometry ( Wong et al 

2006| ) and slab gel electrophoresis ( Das et aT] " 2005 [Laederach] 
et a l. 2008 ), but, empirically, these approaches give unsatisfactory 
performance for CE data. 
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Fig. 1. Overview of the proposed HiTRACE methodology. (A) Raw electropherograms for an example data set. (left) Dimethyl sulfate (DMS) modification of 
the MedLoop RNA ( Kladwang et a/. "201 11, read out by reverse transcription with rhodamine-green-labeled primers followed by DNA separation by capillary 
electrophoresis; data shown are DMS profiles for 80 (of 120) single nucleotide mutants, two replicate controls without chemical modification, and sequencing 
ladders for C, U, and G. (right) Electropherograms of the Texas-red-labeled DNA ladder that was co-loaded with each sample to produce fiducial markers for 
alignment. (B) Profiles after automated preprocessing (baseline subtraction) and correlation- optimized linear alignment. (C) Profiles after automated alignment 
refinement by dynamic-programming-based nonlinear adjustments. (D) Interactive sequence annotation guided by features (red circles) at mutation positions 
and bands in the sequencing ladder. Blue, cyan, orange, and red lines correspond to modifications at A, C, G, and U, respectively. (E) Quantitated band areas 
(the final output of HiTRACE). Shorter DNA fragments (higher mobility) are at the top of each panel. 
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To address the limitations of existing methods, we have developed 
high-throughput robust analysis for capillary electrophoresis 
(HiTRACE) to automate the alignment and quantification of 
nucleic acid structure mapping profiles obtained from hundreds 
of capillaries. As depicted in Figure [T] the proposed method 
consists of four major steps: preprocessing (step A), correlation- 
optimized linear alignment (step B), dynamic -programming-based 
nonlinear adjustments (step C), sequence annotation (step D), and 
peak fitting (step E). After describing the core algorithms that 
underlie the robust automation of each step, we present quantitative 
comparisons illustrating the substantial boosts in both accuracy and 
speed of HiTRACE over previous approaches. With the proposed 
methodology, the previously rate-limiting step of quantifying high- 
throughput CE data is now faster than experimental data acquisition 
times, enabling the investigation of nucleic acid structure at an 
unprecedented rate. 



2 METHODS 

2.1 Experimental setup 

An experimental protocol that is optimal for HiTRACE alignment and 
quantification has been developed; complete descriptions of reaction 
components, purification procedures, and sequencing ladder generation 
have been given previously jPas et aL\ |2010 , Kladwang and Das] |2Q10| 
[Kladwang et al.\ |2011) . Briefly, RNA samples were chemically modified 
under the desired solution conditions and then reverse transcribed with 
primers (labeled at the 5^ ends with the rhodamine green fluorophore) 
complementary to the 3' end of the RNA. Because the reverse transcription 
stops at modified nucleotides, the length distribution of the resulting DNA 
products encodes the chemical reactivities of the RNA. Length separation 
of the DNA was carried out on Applied Biosystems ABI 3100 and ABI 
3730 sequencers; these intruments permit the single-nucleotide separation of 
products as long as 500 nucleotides for 16 and 96 samples, respectively. To 
facilitate HiTRACE alignment, all samples were co-loaded with a reference 
ladder that fluoresces in a different color and provides fiducial markers that 
are identical between samples. The ladder, prepared in a large batch for many 
experiments, was derived by reverse transcribing an arbitrary RNA (typically 
the 202-nucleotide P4-P6 RNA) with a Texas -red-labeled primer. 

2.2 Assumptions and definitions 

CE profiles each contain hundreds of 'bands' (when the data are viewed 
in gray scale) or 'peaks' (when the intensity is plotted as a function 
of electrophoresis time) whose intensities or areas report on individual 
residues of a nucleic acid sequence. In CE experiments that use hundreds of 
capillaries, profiles are typically obtained in multiple batches of experiments, 
e.g., with 16 capillaries in an ABI 3100 sequencer, as illustrated in Figure[2 
The first profile of each batch is designated the reference to which other 
profiles of the batch should be aligned. Each profile i represents fluorescence 
intensity measured at uniformly spaced time points (here, 0.1 seconds) 
denoted byn = 1,2,...,A^ with associated intensity values yi(n). As 
shown in Figure the horizontal and vertical axes correspond to the 
profile index and the measurement position in timepoints, respectively. 
Fluorescence intensity levels are represented in gray scale, with nucleic acid 
species of different lengths appearing as separated, dark bands. The desired 
final output of the proposed methodology is a set of aligned profiles with 
their quantified band areas. 

2.3 Preprocessing (step A) 

In a typical profile, the starting and ending regions contain no signal. To 
accelerate subsequent steps, the user has the option of defining a window 
that brackets the electrophoretic signals in all the profiles. As another 



preprocessing step, we subtract an offset, constant within each profile, so 
as to bring the signal to zero at the boundaries of the window; this step 
corrects for overall drift in signal baselines that are observed in sequencer 
detectors. We have also implemented an option to derive and subtract a 
smooth (but not necessarily linear) baseUne from each profile by using 
a procedure similar to ^X i and Rocke| j2008) . This operation removes 
smoothly varying backgrounds in fluorescence signal sporadically seen in 
experimental CE profiles and, empirically, brings independent replicates into 
closer agreement. 

2.4 Alignment by linear transformation (step B.l) 

The first step involves a linear scaling and shifting of the time axis based 
on maximizing the correlation coefficient between each fluorescence profile 
Hi (n) and the reference profile yi (n) within each batch: 



(A*,cr*)= argmax < corr 



(1) 



where Di and Si represent the sets of possible values of the shift and 
scale factor cr^, respectively, and Di x Si denotes their Cartesian product. 
Based on the values found above, we first time-scale each profile yi(n) 
by (J* using linear interpolation and then shift it by A*. The correlation 
coefficient was chosen as the optimization target because it is independent of 
signal offset and scaling and has been widely used in other alignment tasks 
(Nielsen et al. 1998 Bylund et al. 2002 Pravdova et al. 2002 Tomasi| 
et al . 2004). We carry out the search over shifts (A^) efficiently through a 
Fast Fourier Transform (FFT; |Oppenheim and Schafer 2009). By default, 
we carry out the alignment based on the reference ladder pattern that is 
co-loaded with each sample (see above). 

2.5 Alignment between batches (step B.2) 

Due to variabilities between batches, performing only the intra-batch 
alignment above produces stratified alignment results, where a number 
of up-and-down 'stairs' appear. To resolve this problem, we perform an 
additional inter-batcti alignment.This step constructs a representative profile 
of each batch by calculating the average of the first, middle, and last 
profiles selected from each batch. We align these representative profiles to 
the first representative profile by the procedure above. Assume that, for the 
representative profile from batch b, we have determined A^ and values. 
We then re-align all the profiles in batch b using these A^ and crj values (see 
Figure[T^). More details of step B.2 can be found in the supplement. 

2.6 Nonlinear alignment (step C) 

With current CE equipment, we found that it was not feasible to get 
complete alignment for profiles with single-band resolution through just 
linear scaling of the time axis. There are two reasons for this problem. First, 
the electrophoretic mobilities of the same products in different capillaries, or 
for the same capillary used at different times, can vary due to temperature 
differences and geometry differences. As a result, long profiles, containing 
hundreds of bands measured over tens of minutes, can be aligned well over 
the initial part of the data (e.g., the first two minutes) or the final part of 
the data (e.g., the last two minutes), but both parts cannot be simultaneously 
aligned with a single linear transformation. Second, we often run capillary 
electrophoresis experiments for structure mapping of molecules with slightly 
different sequences, e.g., libraries of single-mutation constructs f Kladwang] 
and Das 2010 Kladwang et al. 2011). This leads to small perturbations in 
the band mobilities at the site of the mutation and requires a locally nonlinear 
transformation to permit alignment. To correct for both these issues, we 
perform another round of refining the alignment. 

The concept underlying the non-linear alignment is depicted in the 
supplement, and resembles the warping method presented in [Nielsen et al.\ 
(1998) for chromatographic data. We break the time axis of a non-reference 
profile into m-pixel windows and then shift each window boundary within 
a predefined range over the reference profile so as to maximize the 
correlation between profiles summed over all windows. We assume that the 
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Table 1. High-throughput RNA structure mapping data sets analyzed by 
HiTRACE. 



2.8 Band deconvolution and quantification (step E) 



Name 


# profiles 


# bands per profile 


# total bands 


X20/H20 DMS-P 


98 


40 


3920 


X20/H20 DMS-2^ 


88 


40 


3520 


MedLoop DMS-1^ 


120 


60 


7200 


MedLoop uNib-z 


13o 


OU 


o 1 An 
oloU 


MedLoop CMCT-1^ 


128 


60 


7680 


MedLoop CMCT-2^ 


120 


60 


7200 


SRP DMS-r 


88 


60 


5280 


SRP DMS-2^ 


96 


60 


5760 


SRP CMCT-1^ 


88 


60 


5280 


SRP CMCT-2^ 


88 


60 


5280 


P4-P6 DMS^ 


480 


182 


87360 


P4-P6 CMCT^ 


480 


182 


87360 


P4-P6 SHAPE^ 


480 


182 


87360 



° |Kladwang and Das||201Q| ^Kladwang gf"^|201l| ^This work. 
Abbreviations: SRP, signal recognition particle conserved domain; P4-P6, P4-P6 
domain of the Tetrahymena group 1 ribozyme; DMS, dimethyl sulfate; CMCT, 
l-cyclohexyl-3-(2-morpholinoethyl) carbodiimide metho-p-toluenesulfonate; 
SHAPE, selective hydroxyl acylation analyzed by primer extension. 



window ordering is preserved during alignment. The number of possible 
arrangements in this setup is large but can be enumerated efficiently by a 
dynamic programming (DP ) approach jCormen et a/.||2009[[Nrelsen et a/.] 
p^98| [Bylund et al\ |2002| [Robinson et a/.||2007t that recursively solves 
the problem for the first window, then the first two windows, etc. As in 
steps B.l & B.2, we accelerate the calculation by computing correlation 
coefficients through EET. Example results are shown in Figure The 
supplement includes a graphical description of determining the shift amount 
for each window edge for aligning two example profiles. 



2.7 Sequence annotation (step D) 

Each band in a fluorescence profile corresponds to a position in the nucleic 
acid sequence. Currently, we carry out sequence annotation interactively 
and manually, as this encourages visual inspection of the data and 
makes use of expert knowledge to ensure accurate annotation. This step 
is accelerated compared to prior approaches (Mitra et aL\ |2008| | Vasa| 
\et al \ |200 8) through visual feedback. Sequence assignments are made 
based on Sanger sequencing ladders included in the experiments; as the 
user makes assignments, 'guidemarks' appear at expected band positions 
(one residue longer than the corresponding position of modification, due 
to dideoxynucleotide incorporation). These guidemark positions can be 
visually confirmed or adjusted to overlay on experimental bands (see circles 
in Fig. [T]D). In addition, these guidemarks can be set to appear on A and 
C positions for dimethyl sulfate alkylation experiments ( Peatti e and Gilbert] 
[1980 Tijerina et al. 2007), as well as mutated positions in mutate-and-map 
experiments (Fig. ^3), which typically give visually distinct perturbations 
in chemical modification. These features provide cross-checks on the 
sequencing ladder that confirm accuracy. Due to the alignment of traces 
achieved in previous steps, sequence annotations need to only be provided 
once and are applicable to all traces. Automated annotation procedures 
are also being developed and will be incorporated in future versions of 
HiTRACE. 



In this last step of HiTRACE, we approximate each profile y(n) for n 
1, 2, . . . , by a sum f{n) of K Gaussian curves with the form 

K 

/W = '^^k exp 

fe=i 

such that the deviation defined by 



{n - iJ.k)'^ 



(2) 



N 



(3) 



is minimized. A^, /i^ and (7^ are the parameters that determine the 
amplitude, the center location and the width, respectively, of a peak 
modeled by a Gaussian. We find the optimal values of these parameters 
by a standard Levenberg-Marquardt optimization technique for least-square 
minimization ( Levenberg] |1944[ |Marquardt| |1963) , and report the area of 
each peak as the final output. 

2.9 Implementation and data preparation 

We implemented the proposed HiTRACE methodology in the MATLAB 
programming environment (The MathWorks, http://www.mathworks.com) 
and are making it freely available for download at http://hitrace.stanford.edu. 
For comparison with HiTRACE, we also prepared the implementations 
of the five different profile analysis algorithms: CAFA (|Mitra et ^ 
[2008) , Sha peFinder jVasa et al\ [2008) , msaUgn jKazmi et al\ [2006) , 
SpecAlign (Wong et a/.||2005) , and COW ( [Tomasi gra/.||2004) . We could 
not apply some methods to all situations due to their intrinsic limitations. 
For instance, the alignment feature of CAFA and ShapeFinder requires 
significant manual intervention to handle hundreds of profiles; we did not 
include ShapeFinder in the alignment result comparison. Similarly, msalign, 
SpecAlign, and COW can align profiles but do not carry out peak fitting. We 
thus excluded them in fitting result comparisons. 

2.10 Criteria for evaluating alignment results 

We applied two widely used mathematical criteria — the mean squared error 
(MSE; "Kay' "1993) of aUgned peak positions with respect to the reference 
peaks and the KuUback-Leibler (KL) divergence (Cover and Thomas ||2006) 
between reference and non-reference profiles. 

In MSE computation, we consider the position p of each peak in the 
reference profile as the true value being estimated, and use the position p 
of the aligned peak on a non-reference profile as the estimator of p. The 
MSE for the j-th reference peak pj is then 



1 ^ 

MSEJ=E[{p,-pJf] = -Y,{P^: 



(4) 



where L is the number of profiles in the data set used, and pij and pij 
represent the positions of the j-th reference peak and the peak on profile i 
that is aligned to pj , respectively. For the peak detection step involved in 
the MSE computation, we used the peak algorithm described by |Mitra et al. \ 
f2008), which is specifically designed for finding peaks in CE profiles and 
shows satisfactory performance for our purpose. 

To evaluate the alignment results from an information-theoretic 
perspective, without explicitly considering specific peaks or band positions, 
we utilized the KL divergence. We calculated the KL divergence between 
the reference profile yi (n) and a non-reference profile yi (n) as 



DKL{yi\\yz) = 2/1 (n) log 



(5) 



where is the number of pixels in each profile. We repeat this calculation 
for every reference and non-reference pair in a data set. Before computing 
KL divergence, intensity values were limited to two standard deviations 
above the mean to prevent KL divergence values from being dominated by 
strong bands at the beginning and end of each profile. 
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3 RESULTS 

3.1 High- throughput RNA structure mapping data sets 

To test HiTRACE, we collected 13 nucleic acid structure mapping 
experiments read out by capillary electrophoresis (Table [TJ. These 
data sets were diverse: probed molecules included artificial model 
systems (the MedLoop RNA and the X20/H20 RNA/DNA system) 
as well as natural structured RNAs (a conserved domain from 
the signal recognition particle and the P4-P6 domain of the 
Tetrahymena group I ribozyme), with lengths between 60 and 202 
nucleotides. Three common chemical modification strategies were 
represented in the data: dimethyl sulfate alkylation (Tijerina et a/J 
|2007j ), carbodiimide modification ( Walczak et al 1996 ), and 2' -OR 
acylation [the SHAPE strategy ( ^Merino et al 2005 )]. In addition, 
the data sets were challenging in their size. Three experiments each 
gave 182 bands over 480 electropherograms, for a total of 87,360 
bands per data set. Finally, to test the precision of quantification 
relative to other sources of error, five experiments were conducted 
twice by two independent researchers. Additional data sets were 
collected to confirm HiTRACE' s ability to quantify data for RNAs 
over 400 nucleotides in length (the L-21 Seal Tetrahymena group I 
ribozyme) and to compare overlapping SHAPE data derived from 
reverse transcription starting at different primers on the same RNA 
(the P4-P6 domain). Overall, these data sets provide a diverse and 
challenging benchmark of nucleic acid CE experiments at the large 
scale permitted by current high-throughput experimental protocols. 



3.2 Robust ahgnment of CE profiles 

As the most basic test, we first compared the alignment results 
of HiTRACE with previously available methodologies by visual 
inspection (Figure [2]). Prior to alignment, CE experiments gave 
initially poor alignments of DMS chemical mapping profiles for 
the 60-band MedLoop RNA and the 182-band P4-P6 RNA ("Raw" 
in Fig. |2]\ & B). Application of automated HiTRACE alignment 
aligns the strong bands across all profiles ("HiTRACE" in Fig. [2]\ 
& B; see also Fig. [T]\-C). In the alignment results produced by 
methods other than HiTRACE, profiles within each group tend 
to be reasonably aligned whereas profile groups are not well- 
aligned. We did not observe this 'stratification' problem in the 
HiTRACE result, mainly due to the inter-batch alignment step (B.2) 
used by HiTRACE. Additionally, comparing HiTRACE results 
with SpecAlign and CAFA results reveals the effectiveness of 
the HiTRACE nonlinear alignment step, which adapts alignment 
to weakly varying electrophoretic rates along the profile. In the 
alignment results produced by SpecAlign and CAFA, some parts 
of the profiles appear reasonably aligned, but the top (SpecAlign) 
or bottom (CAFA) portions are not well-aligned. For msalign and 
COW, this problem is much more noticeable. 

For more quantitative evaluation of profile alignments, we 
compared the different methodologies in terms of two mathematical 
criteria, mean squared error in peak position (MSE) and KL 
divergence between profiles. Figures|2]l! & D show the distributions 
of the average MSE and KL divergence values over the 13 data 
sets used for different algorithms. With respect to HiTRACE, the 
alternative methodologies produced poorer results, 1.73-3.09 and 
1.51-3.94 times higher median MSE and KL divergence values, 
respectively. 



A COW 

1-^^'^' HiTRACE msalian Sp&eAlign [lapyiDj CAFA 



^ 12L1L1 




d 5D 1M d so 100 D 9ft IDD D 5D D 5D Ida d Sd Idd 

D COW 

^ Raw HiTRAOE msalign SfMcAJign (loOnO) CAFA 





D I^dD -l-dD d 200 4dD d ZDQ 4da D ZDQ ^DQ d 2dd 4dd d 4dD 

Profile 




Raw HrTRACE msalign SpecAlign COW COW CAJ=A 




Raw HiTRACE msalign SpfrcAJign COW COW CAFA 



Fig. 2. Comparison of available alignment strategies for nucleic acid 
CE profiles. (A) Comparison of electrophoretic profiles of the 88-profile 
MedLoop DMS mutate-and-map data set (Kladwa ng et aP^^ll) (replicate 
2) before alignment and after alignment by HiTRAC E, msalign jKazmi| 
et al. 2006), SpecAlign (Wong et al. 2005 1, COW (To masi et al\\200A) 
and CAFA jMitra et a/.| |2008 i. (B) Alignment results for the 480-profile 
P4-P6 DMS data set. (C) Quantitative comparison of alignment results for 
all 13 data sets based on mean squared error (MSE; [Kay] 1 1993) of aligned 
peak positions with respect to the reference peaks. Red line is median value; 
box boundaries represent 75th and 25th percentiles; error bars represent the 
most extreme values whose distance from the box is less than 1.5 times 
the box length; + symbols are outliers beyond this range. (D) Quantitative 
comparison in terms of the KuUback-Leibler divergence ^Cover and Thomas] 
,2006j between reference and non-reference profiles. 
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3.3 Leveraging accurate alignments into accurate 
quantification 

To assess the accuracy of the entire quantification procedure, 
including ahgnment, sequence annotation, and band deconvolution, 
we compared final quantified results between HiTRACE and 
previously available software for RNA structure mapping CE data, 
using two MedLoop DMS mutate-and-map data sets ( [Kladwangl 
\et al\ [2011 ) (see also supplement for a comparison with the 
X20/H20 DMS data). Each set contained at least 120 profiles 
with 60 bands, for a total of 7200 data points per set. (Further 
comparisons between software packages were precluded by the 
difficulty of carrying out the analysis with prior software: 
ShapeFinder gave poor alignment even after several hours of manual 
intervention, and CAFA analysis required 10 hours of manual 
adjustment.) 

The MedLoop sets gave excellent Pearson correlation coefficients 
between band intensities quantified with HiTRACE to those 
quantified with CAFA (r of 0.979 and 0.965; Fig. [3]\&B), 
confirming the lack of any major systematic errors introduced by 
the HiTRACE method. We hypothesized that the small, residual 
variance between the methods might stem from user-introduced 
variation during alignment (CAFA) or sequence assignment of 
bands (in CAFA and HiTRACE). To test this hypothesis, we 
carried out replicate quantification of the same data sets; the second 
independent analysis gave values with correlation coefficient (r) to 
the first analysis of 0.987 and 0.989 (HiTRACE; Fig. [Sp & D) 
and 0.989 and 0.974 (CAFA; Fig. [3f & F). We conclude that 
any differences between HiTRACE and CAFA can be explained 
by imprecision (variance of 1.1-1.3% in HiTRACE and 1.1-2.6% 
in CAFA) introduced by users; this error is much smaller than 
variances arising from experimental error, as is discussed next. 

3.4 Consistency in band quantification between 
experimental replicates 

A stringent measure of the accuracy of an experiment and 
its analysis is the correlation of quantified intensities between 
independent replicates. The goodness of this correlation is 
determined by experimental factors, including small variations 
in sample purity, pipetting errors, temperature differences, and 
variable times of each experimental step, and is also sensitive to any 
uncertainties arising from the data analysis procedure. We compared 
correlation coefficients between separate independent replicates of 
the MedLoop DMS mutate-and-map experiments |Kladwang et al. \ 
[20TTT), quantified by both HiTRACE and CAFA (Fig. |§\&B). 
In both cases, the cross-replicate correlations (0.89-0.90) are 
significantly lower than the intra-replicate comparisons (0.97-0.99) 
above, verifying that variances in experimental procedures exceed 
any variances in the data quantification. 

The throughput of HiTRACE quantification enabled us to carry 
out this cross-replicate comparison for the additional replicate 
sets (see supplement) and to explore whether alternative data 
processing schemes might improve the precision of the HiTRACE 
quantification. We tested a computationally expensive band 
deconvolution procedure [previously used in SAFA ( |Das et al\ 
|2005j )] that optimized centers of fitted Gaussians for each individual 
profile. We observed indistinguishable cross-replicate correlation 
coefficients (Fig. |4p) with this procedure as compared to the the 
default HiTRACE method (no peak refinement). This comparison 
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Fig. 3. Quantification accuracy and precision for HiTRACE and CAFA. 
Correlation of HiTRACE and CAFA results for two MedLoop DMS data 
sets [(A) & (B)] confirms the absence of any systematic deviation between 
the two approaches. Precision of HiTRACE [(C) & (D)] is similar or better 
than CAFA [(E) & (F)], based on independent analyses of the same data set. 



further validated the high quality of the profile-to-profile alignment 
in earlier HiTRACE steps, and motivated our choice to make 
as the HiTRACE default the 10- to 100-fold faster band- 
deconvolution procedure without band position fitting. We observed 
similarly invariant or slightly worse correlation coefficients in 
experiments without the baseline subtraction procedure; with 
additional alignment steps of 'binarized' profiles; and with other 
methods to automatically refine band positions in each profile (see 
supplement). 

3.5 Reduced time demand of quantification 

Although HiTRACE relies on multiple steps for accurate analysis, 
the time demand of quantification by HiTRACE was considerably 
smaller (a few minutes) than the time required by prior informatic 
approaches as well as the time involved in preparing and obtaining 
the CE experiments (a few hours). Figure [5] shows the average 
running time of HiTRACE for different data sets, along with the 
breakdown of the running time. The largest data set (P4-P6 CMCT; 
480 profiles and 87360 bands) took approximately twelve minutes 
to quantify, and the smaller sets (88-136 profiles, 4000-8000 
bands) required three minutes or less. Overall, HiTRACE averaged 
1.58 seconds per profile from beginning (raw data load-in) to end 
(quantified band intensities). For the same data sets, the overall 
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X20/H20DMS Medloop DMS Medloop CMCT SRP DMS SRP CMCT 

Data 



Fig. 4. Correlation of results between experimental replicates for 
HiTRACE (A) and CAFA (B) for the MedLoop DMS mutate-and-map 
experiments fKladwang et a/. "201 V), and (C) for HiTRACE on five replicate 
data sets without (black bars) and with (white bars) optimization of Gaussian 
positions during band deconvolution. 



computational time of the tools for alignment only (i.e., msalign, 
SpecAlign, and COW) were between tens of minutes to two hours 
(without peak fitting) depending on the data size. As discussed 
above, CAFA and ShapeFinder, the previous full suites available 
for nucleic acid CE quantification, required even more time (hours 
for the smaller data sets, extrapolated to days or weeks for the 
larger sets). As shown in Figure [5] the HiTRACE time breakdown 
is similar for all data sets, except for the 480-profile data set (P4- 
P6 CMCT), in which later stages are lengthened by increasing 
the number of bands in each profile (200 residues in the P4-P6 
RNA, compared to under 100 residues for the other RNAs). We 
further used HiTRACE on data sets with longer RNAs (up to 400 
nucleotides) and reverse transcribing from primers in the middle of 
a long RNA; the HiTRACE procedure was readily applied to these 
data sets (see supplement), and, encouragingly, the time demand 
remained linear with the number of bands. 



4 DISCUSSION 

HiTRACE employs a series of automated techniques to control the 
high level of variability in parameters of CE systems and to resolve a 
key alignment bottleneck of modern nucleic acid structure mapping 
experiments. Several algorithmic advances are responsible for 
HiTRACE' s accuracy and speed, including dynamic programming 
strategies that have not been been previously considered in the 
field. Quantitative comparisons on large experimental data sets 
demonstrate the utility of a linear time-axis transformation used in 



- Total running time (HiTRACE) 

■ Initialization 

■ Loading files 

□ Automatic linear alignment (intra-batch) 

■ Inter-batch alignment and baseline subtraction 

□ Auto-refine alignment (by dynamic programming) 

□ Interactive band annotation 

□ Band deconvolution 

I Alternative method (CAFA) 



750.9 



748.2 - 



-746.3 



X20/H20 
DMS 
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DMS CMCT 



SRP SRP 
DMS CMCT 
Data 



P4P6 
DMS 



P4P6 
CMCT 



P4P6 
SHAPE 



Fig. 5. The average running time of HiTRACE on different data sets. The 
time was measured on a personal computer system equipped with a 2.66GHz 
Core 15 processor (4 cores; multithreading enabled) with 4GB RAM. 



globally aligning profiles as well as the importance of a nonlinear 
alignment procedure for resolving further unavoidable variations 
in elution rates along a capillary. In addition, an interactive 
band annotation interface increases user convenience and provides 
accurate starting positions for the subsequent quantification step. 
These improvements have brought down the overall analysis time 
of data sets with tens of thousands of electrophoretic bands 
from days to minutes. The largest time-savings of the method 
are on experiments in which the same RNA sequence is probed 
under a variety of solution conditions, chemical modifiers, kinetic 



timepoints or mutations [see, e.g., (|Das et al. 


|2010| Mitra 


et al.\ 20081 IWilkinson et |2008| |Weeks| 2010 | 


Oadwang and 


Das |2010 IKladwang et al. |2011|)]. Now, the slow step in these 



and other experiments is interactive band annotation, which takes 
minutes (Fig. |5]). As more automated band assignment methods 
are developed (R.D., unpubl. results; personal comm., P. Pang, M. 
Elazar, J.S. Glenn), we plan to incorporate them into this interface. 

Although we designed HiTRACE primarily for RNA chemical 
structure mapping, the principles and premises that underlie 
HiTRACE are general and can easily be modified for use in other 
types of experimental assays. To enhance the adoption of this 
tool, we have created a stand-alone version of HiTRACE with a 
graphical user interface. We are also making the source code freely 
available to encourage further innovation and incorporation of these 
algorithms into other laboratories' CE software suites. Beyond the 
data sets discussed herein, HiTRACE is continuously being used for 
other studies, totalling over 20,000 profiles (greater than 2 million 
bands) at the time of submission (unpubl. data, W.K., R.D.; see 
also http://rmdb.stanford.edu). Given its accuracy, robustness and 
efficiency, we expect that HiTRACE will become a valuable tool 
for nucleic acid experimentalists entering a high-throughput era of 
structural analysis. 
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SUPPLEMENT 

1 ADDITIONAL DETAILS AND EXAMPLES 

1.1 Step B.2 (inter-batch alignment) 

The alignment procedure in step B.l can be considered as an intra- 
batch step in that we separately align the fluorescence profiles 
in each batch without considering profiles in other batches. Due 
to variabilities between batches, performing only the intra-batch 
alignment above produces stratified alignment results, where a 
number of up-and-down 'stairs' appear. To resolve this problem, 
we perform an additional inter-batch alignment, as illustrated in 
Figure |6] 

1.2 Binarization-based alignment (step B.3; optional ) 

This step can be optionally applied as the last of the correlation 
optimized linear alignment steps; because it did not improve 
precision assessed in cross-replicate correlation experiments, it is 
not performed in the default HiTRACE workflow. After inter-batch 
alignment, this step performs a peak detection on each profile and 
then binarize the profile so that the intensity at a peak position 
is set to 1 and the rest is set to 0. We then align the binarized 
profiles as before and use the resulting scale and shift information 
for re-aligning the original, non-binary profiles. This has the effect 
of low-pass filtering ( O ppenheim and Schafer| [2009 ) to suppress 
high-frequency noise components and aligns only peaks within each 
profile; it can give an improvement in alignment near the top of 
the data where multiple intense electrophoretic products overlap. 
For the peak detection process, we found that any reasonable peak 
detection method can be employed; we utilized the one described 
in |Kim et a/.] ( [2009l ). 

1.3 Step C (nonlinear alignment) 

The concept underlying the non-linear alignment step is depicted in 
Figure [7]\. Figure |7p-C shows an example of determining the shift 
amount of each window edge for an actual fluorescence profile. 

1.4 Step D.2 (automated transfer of band annotation; 
optional) 

Each band in a fluorescence profile corresponds to a position in 
the nucleic acid sequence. Given an annotation of one reference 
profile, HiTRACE can automatically annotate the other fluorescence 
profiles using a dynamic programming approach similar to the 
Needleman-Wunsch algorithm f Needleman and Wunsch[ |1970| ). 
This procedure was used before the development of nonlinear 
dynamic-programming-based alignment (step C); at that time, the 
final alignment of profiles was poorer in quality. With the current 
software including non-linear alignment, automated transfer of 
band annotation does not improve precision assessed in cross- 
replicate correlation experiments, so it is not performed in the 
default HiTRACE workflow. Nevertheless, we briefly summarize 
the annotation transfer algorithm here, as it can be carried out in 
HiTRACE (as the guess -alLpeaks script), and appears useful in 
a partially developed strategy for automated sequence assignment 
(R.D., unpub. results). 

The procedure of transferring the annotation from the reference 
profile to all other profiles starts with identification of bands in 
each profile by a peak detector. Due to noise and imperfections in 
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m-pixel window 



Re-align 
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amounts for 
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Fig. 6. Inter-batch alignment (step B.2). A representative profile is 
constructed for each batch that has been aligned in step B.l. All 
representatives are collected and then aUgned by the intra-batch algorithm, 
as if these were from a single batch. The resulting scale and shift amounts 
for each batch are used for re-ahgning the batch. 



experiments and analysis, some of the automatically detected bands 
do not have a matching annotation (these are called inserts), whereas 
some bands assigned in the manual annotation do not correspond to 
any automatically detected bands (deletions). See Supplementary 
Figure 3 for an example. Transferring band annotations requires 
accurate identification of which bands are extraneous or missing 
in each non-reference profile, a task that we carry out through a 
dynamic programming strategy. 

Let sequence R = (ri, r2, . . . , r^, . . .) denote the manually 
annotated band positions (in pixels) in the reference. Similarly, 
given a profile to be aligned to the reference profile, let sequence 
A = (ai, a2, . . . , aj, . . .) denote the band locations. For dynamic 
programming, we build a score matrix F indexed by i and j (i for 
R and j for A), where the value F{i,j) indicates the score of the 
best alignment between the prefix (ri , r2 , . . . r^) of and the prefix 
(ai , a2, . . . , aj) of A. The matrix F can be filled recursively by the 
following formula 



F{i — 1, j — 1) + matchScore(i, j) 
F(i,j) — min <^ F{i — l,j) + deletionPenalty(i) 
^F{i^j — 1) + insert Penalty 



(6) 



after a trivial initialization process. Backtracking on the matrix F 
reveals the optimal assignment of automatically found bands to 
manual annotations (Figure [S]). For any deletions, we estimated 
band locations missing in the non-reference profile based on linear 
interpolation between the nearest bands that match in the reference 
and non-reference profile. 

We considered a few factors to define matchScore(i, j); these 
functional forms and presented parameter settings were defined 
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Fig. 7. Nonlinear alignment (step C). (A) We break the time axis of a non- 
reference profile into m-pixel windows and drift each window (in pixel units; 
each pixel is 0.1 seconds) within a predefined range over the reference profile 
to find the shift amount that maximizes the correlation of the window and the 
corresponding fragment of the reference profile. The 'slack' is the amount 
by which we can extend or shrink each boundary of a window with respect 
to m, the default window size. The 'max shift' is the largest difference 
possible between a window boundary in a non-reference profile and its 
corresponding boundary in the reference profile. The 'sliding range' is the 
search region in the reference profile over which we compare a window from 
a non-reference profile. We find the optimal shift amount of each window by 
dynamic programming (DP). (B) Example of the score matrix for DP-based 
profile alignment. For each window, we determine its optimal shift offset 
using this matrix. The objective function is the total correlation coefficient 
value accumulated over all windows. Shown is the matrix for aligning the 
first and the sixteenth profile of the Medloop CMCT data (replicate 2) 
described in the result section. We set the window size m to 100, producing 
30 windows in total. The maximum amount of shifts allowed was 100, and 
the slack size used was 10. (C) A matrix to show the possible shift offsets 
of each window. Red circles indicate the optimal offsets determined by the 
backtracking procedure in (B). 
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10 15 20 25 30 35 40 45 50 55 
Index j (peaks on non-reference lane) 

Fig. 8. Automated transfer of band annotation (step D.2; optional). (A) 
Some of the automatically found bands did not have a matching annotation 
in the reference profile. We name these extra bands inserts. We do not find 
some sites in the manual annotation via automated peak fitting, and call 
these missing bands (i.e., false negatives) deletions. Shown are the first and 
the 120-th profiles of the Medloop CMCT data (replicate 2) described in 
the result section. (B) Example of the score matrix F(i,j) for dynamic- 
programming-based transfer of band annotation of the 120-th profile shown 
in (A). The optimal band assignments found by backtracking are shown 
using red circles. 



based on empirical results on one large-scale data set (P4-P6 
SHAPE; see Table 1 in the main article); the other data sets present 
independent tests of these parameters. The match penalty is a 
weighted sum of four factors: 



matchScore(i, j) = ^^'^fc " niatchScorefc(i, j) 



(7) 



where Wk is the weight of factor k. First, we let the match penalty 
proportional to the distance between n and a^, penalizing distant 
matches. The first component is thus given by 



matchScorei 



(8) 



■'r t I Profile 70 pj.„ 1 



Data: X20/H20 DMS (replicate 1) 
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o 0.4 



mean: 0.9543 
median: 0.9646 
std: 0.0416 
max: 0.9943 
min: 0.7255 
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P< 0.001 




Fig. 9. Correlation between peak areas quantified by HiTRACE and 
CAFA ( Mitra et "aT] |2008) for an additional data set. (A) Each point 
on the plot indicates the correlation coefficient between the peak areas 
HiTRACE and CAFA computed for an identical profile. The data used was 
X20/H20 DMS (repHcate 1), which has 98 profiles. The time demands for 
quantification by HiTRACE and CAFA were approximately 4 minutes and 
7 hours, respectively. (B) The distribution of the correlation coefficients. 
The average value was high (0.9543), suggesting that the results obtained 
from HiTRACE and CAFA are highly correlated for this data set. (C) More 
detailed plots for four arbitrary profiles. 



where dR is the average distance between two adjacent reference 
peaks in sequence R. Second, we consider the degree of peak-to- 
peak separation as follows: 



matchScore2 (i, j) 



(9) 



where i* and j* represent the position of the previous matching pair. 
Third, we consider the difference between the intensity of and aj 
relative to the previous matching location: 



matchScores (i, j) 



log 



log 



I{aj 



(10) 



where /(•) represents the profile intensity. Lastly, we reward band 
assignments to points of greater intensity up to a point by defining 
the last component: 



matchScore4(i,i) = 1 — min 



(11) 



where Ir is the median intensity of the reference profile. The 
weights used are wi — 1,W2 = 4, 1(73 = 0.25 and W4 — 2. 

The insertPenalty was set to 1.5. To determine deletionPenalty(i) 
we used an expectation-maximization (EM) approach ("Bishop 
|2006 ). After a first run with an initial constant value (4.5) for 
deletion penalty, we carried out a second run with deletion penalties 
inversely proportional to deletion frequencies at each peak position 
i seen in the first run. 
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ABC 




ddATP - + ddATP - + 



Fig. 10. Rapid HiTRACE annotation for longer RNAs. (A) Capillary 
electrophero grams for an experiment probing the "unfolded" L-21 Seal 
ribozyme in 50 mM Na-HEPES, pH 8.0 (W.K., R.D., unpub. results). 
The fluorescence profiles (arbitrary units) are, from left to right, ddATP 
sequencing ladder, control reaction with no chemical modifier, and 
experiment with the NMIA reagent (for SHAPE acylation). (B) View in 
HiTRACE near the 'top' of the data; note guidemark symbols in ddATP 
ladder (C) View in HiTRACE near the 'bottom' of the data. 
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Fig. 13. Profiles from experimental replicates of Medloop DMS data. 
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Fig. 15. Profiles from experimental replicates of SRP DMS data. 
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Fig. 16. Profiles from experimental replicates of SRP CMCT data. 



16 



HiTRACE: High-throughput analysis for capiiiary eiectrophoresis 




Fig. 11. Rapid HiTRACE annotation for reverse transcription from primers 
internal to an RNA sequnce. Data were collected for the P4-P6 RNA with 
primers to (A) the middle of this RNAs sequence (position 170) and to 
(B) the RNAs 3' end (position 270). For both sets, the fluorescence data 
are, from left to right, ddATP sequencing ladder, control reaction with no 
chemical modifier, and experiment with the NMIA reagent (for SHAPE 
acylation). (C) Correlation between the independent data sets. 
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